Fix stateless HTTP task accumulation causing memory leak by wiggzz · Pull Request #2145 · modelcontextprotocol/python-sdk

wiggzz · 2026-02-25T15:48:34Z

Summary

Fix memory leak where run_stateless_server tasks accumulate indefinitely in the manager's global task group after client disconnects
Use request-scoped task groups instead of the global _task_group for stateless requests, ensuring tasks are cancelled when requests complete
Add graceful SSE drain on shutdown: terminate all active transports before cancelling the task group, so SSE streams close cleanly instead of being abruptly reset
Add regression tests for both the leak and the graceful drain

Motivation and Context

Related to #1764 (the underlying task accumulation issue in stateless mode).
Fixes #1272 (graceful shutdown / SSE drain).

Root cause — memory leak

When a client disconnects while a tool call is in progress, EventSourceResponse detects the disconnect and tears down the SSE stream pipeline. However, Server.run() — which is processing the tool call in a separate task spawned into the manager's global _task_group — continues running. After the tool handler completes, the response has nowhere to go (the request streams were cleaned up), and Server.run() blocks indefinitely on async for message in session.incoming_messages. These zombie tasks accumulate one per disconnected request.

In production, we observed 449 leaked ServerSession objects, 66,086 orphaned coroutines, 4,463 blocked stream receive operations, and ~150 MB of leaked heap — all from this pattern.

Root cause — broken SSE close on shutdown

The memory leak fix moves stateless server tasks from the manager's global task group to request-scoped task groups. This fixes the leak, but it also means the manager's tg.cancel_scope.cancel() during shutdown no longer reaches stateless transports — nothing ever closes their streams, so SSE responses hang indefinitely. A reverse proxy like nginx would log "upstream prematurely closed connection while reading upstream".

The graceful drain fix adds explicit terminate() calls in run()'s finally block (for both stateful and stateless transports) before cancelling the task group. This closes the in-memory streams, letting EventSourceResponse send a final more_body=False chunk — a clean HTTP close.

Why the existing test didn't catch the leak

The existing test (test_stateless_requests_memory_cleanup, added in PR #1140) mocks app.run to return immediately, so the spawned tasks complete naturally. It never exercises the real-world scenario where client disconnects leave app.run() blocked.

The fixes

Memory leak: Replace the global task group spawn with a request-scoped anyio.create_task_group(). Both the server task and the request handler run inside this scoped group. When the request handler finishes (or is cancelled due to client disconnect), the scope is cancelled, which automatically cancels the server task. No tasks leak into the manager's long-lived task group.

Graceful drain: Track in-flight stateless transports in _stateless_transports. In run()'s finally block, call terminate() on all active transports (stateful via _server_instances, stateless via _stateless_transports) before tg.cancel_scope.cancel().

Testing

test_stateless_requests_task_leak_on_client_disconnect: Real Server.run() with a real tool handler, real SSE streaming, simulated client disconnect — verifies zero leaked tasks after 3 disconnected requests
test_graceful_shutdown_closes_sse_streams_cleanly: Uses an ASGI send() interceptor to verify that SSE responses receive more_body=False (the ASGI equivalent of a proper HTTP chunked transfer close) during manager shutdown. Without the fix, the test times out because nothing ever closes the stateless transport's streams.

All 13 tests pass.

Breaking changes

None. This is an internal implementation change. The public API and behavior are unchanged.

Checklist

I have read the MCP Documentation
My code follows the repository's style guidelines
New and existing tests pass locally
I have added appropriate error handling
I have added or updated documentation as needed

Use request-scoped task groups for stateless requests instead of the manager's global task group. When a client disconnects while a tool call is in progress, the request-scoped task group is cancelled, preventing zombie tasks from accumulating. In production, this caused 449 leaked ServerSession objects, 66,086 orphaned coroutines, and ~150 MB of leaked heap. Upstream PR: modelcontextprotocol#2145 Github-Issue: modelcontextprotocol#1764

Use request-scoped task groups for stateless requests instead of the manager's global task group, preventing zombie task accumulation when clients disconnect mid-request. Upstream PR: modelcontextprotocol#2145

Upstream PR: modelcontextprotocol#2145

…tprotocol#756) In stateless mode, each request spawned a `run_stateless_server` task into the manager's global `_task_group`. After `handle_request()` completed and `terminate()` was called, the task continued running inside `app.run()`, blocked on `async for message in session.incoming_messages`. These zombie tasks accumulated indefinitely, leaking memory. Replace the global task group spawn with a request-scoped task group so that server tasks are automatically cancelled when their request completes. Add a regression test that simulates the real blocking behavior of `app.run()` using `anyio.sleep_forever()` and verifies no tasks linger in the global task group after requests finish. Closes modelcontextprotocol#756 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ntextprotocol#1764 (open) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add anyio.sleep(0) between terminate() and cancel_scope.cancel() so the message router can observe the ClosedResourceError from the closed streams before the task group is cancelled (restores line 1022 coverage) - Remove handle_list_tools from test — it was never called (only call_tool is exercised), so its return was an uncovered line - Mark the CallToolResult return as pragma: no cover — intentionally unreachable in this test because the task is always cancelled at await tool_gate.wait() before the return executes - Collapse multi-line function signatures to satisfy ruff-format Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The sleep(0) yield in the stateless request handler allows the event loop to exercise exception paths that were previously unreachable with mocked tests. These paths are legitimately reachable and the more realistic test now covers them consistently: - streamable_http_manager.py: except Exception in run_stateless_server - streamable_http.py: except Exception in SSE pipeline (x2) - streamable_http.py: if not self.mcp_session_id (always true in stateless mode) - session.py: except Exception in receive loop cleanup Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The sleep(0) between terminate() and cancel_scope.cancel() was non-deterministic: sometimes the event loop ran exception handlers (covering the pragmas), sometimes the cancel won the race (leaving lines uncovered). This caused strict-no-cover to flip between "wrongly marked" and "missing coverage" across CI runs. Revert to the simple, deterministic approach: - Restore all original pragma: no cover annotations - Add pragma: no cover for the ClosedResourceError/_terminated path in the message router, which is unreachable with request-scoped task groups (Cancelled always wins over ClosedResourceError) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The if-branch check itself is reachable; it's the body that isn't. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Cancel the request-scoped task group first so the Cancelled exception deterministically reaches the server task before terminate() closes the streams. This avoids a race between Cancelled and ClosedResourceError in the message router. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Change 5 pragma annotations from strict "no cover" to "lax no cover" since these exception handlers may or may not fire depending on cancellation timing. Fix ruff-format in test file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Mark 4 test assertions as lax no cover — coverage.py on Python 3.11 falsely reports them as uncovered despite the test passing. Also fix ruff format for assert message and async with blocks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add None check assertion and type: ignore for accessing private _tasks attribute on TaskGroup (needed to verify no leaked tasks). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The previous approach ran handle_request inside a child task within the request-scoped task group. This caused coverage.py on Python 3.11 to lose track of execution when returning from the task group back to the caller — the root cause of the false coverage gaps on test assertions. The fix is simpler: run handle_request directly in the async with body (like the original code on main), with only run_stateless_server as a child task. This preserves the request-scoped task group for preventing task accumulation while keeping the caller's execution context intact. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Terminate all active transports (stateful and stateless) before cancelling the task group during shutdown. This closes in-memory streams so EventSourceResponse can send a final more_body=False chunk — a clean HTTP close instead of a connection reset. Without this, the request-scoped task groups introduced for the memory leak fix isolate stateless transports from the manager's cancel scope, so nothing ever closes their streams on shutdown.

Request-scoped task groups for stateless requests prevent zombie task accumulation when clients disconnect mid-request. Graceful drain on shutdown terminates all active transports before cancelling the task group, allowing SSE responses to close cleanly. Backported to v1.26.0 for compatibility with dbt-mcp and ai-codegen-api. Upstream PR: modelcontextprotocol#2145

@wiggzz

## Why The upstream MCP Python SDK has a memory leak where stateless HTTP request tasks accumulate in the global task group when clients disconnect mid-request ([modelcontextprotocol/python-sdk#1764](modelcontextprotocol/python-sdk#1764)). This causes unbounded memory growth in long-running MCP servers like our remote dbt MCP proxy. The fix (request-scoped task groups) also introduces a regression where SSE streams don't close cleanly on shutdown — reverse proxies see connection resets instead of clean HTTP close. A graceful drain step resolves this. ## What Points the `mcp` dependency at `dbt-labs/mcp-python-sdk@dbt-labs/patched`, which includes: - Request-scoped task groups for stateless requests (prevents zombie task accumulation) - Graceful SSE drain on shutdown (terminates all active transports before cancelling the task group) Upstream PR: [modelcontextprotocol/python-sdk#2145](modelcontextprotocol/python-sdk#2145) ## Notes This is a temporary pin until the upstream PR is merged and released. Once upstream publishes a release containing the fix, we should revert to the published package. Drafted by Claude Opus 4.6 under the direction of @wiggzz

wiggzz · 2026-04-01T13:38:48Z

@maxisbey any thoughts on this one?

Request-scoped task groups for stateless requests prevent zombie task accumulation when clients disconnect mid-request. Graceful drain on shutdown terminates all active transports before cancelling the task group, allowing SSE responses to close cleanly. Backported to v1.26.0 for compatibility with dbt-mcp and ai-codegen-api. Upstream PR: modelcontextprotocol#2145

wiggzz force-pushed the fix/stateless-task-group-leak branch 2 times, most recently from 67d02d3 to a7a340d Compare February 25, 2026 17:00

wiggzz mentioned this pull request Feb 27, 2026

Race condition in StreamableHTTP: zero-buffer memory streams cause deadlock with concurrent SSE responses #1764

Open

2 tasks

maxisbey added bug Something isn't working P2 Moderate issues affecting some users, edge cases, potentially valuable feature labels Mar 5, 2026

wiggzz mentioned this pull request Mar 9, 2026

Graceful SSE drain on session manager shutdown #2239

Closed

wiggzz added a commit to dbt-labs/mcp-python-sdk that referenced this pull request Mar 9, 2026

fix: stateless HTTP task leak and graceful SSE drain on shutdown

f951e5c

Upstream PR: modelcontextprotocol#2145

wiggzz and others added 14 commits March 9, 2026 13:21

Fix issue reference from modelcontextprotocol#756 (closed) to modelco…

8bfcca2

…ntextprotocol#1764 (open) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move pragma: no cover to logger.debug (the unreachable line)

f6909a1

The if-branch check itself is reachable; it's the body that isn't. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove unused Tool import flagged by ruff

24da35a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix pyright errors for private task group access in test

35f08c7

Add None check assertion and type: ignore for accessing private _tasks attribute on TaskGroup (needed to verify no leaked tasks). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add assertion for session_manager not None (pyright)

a7907d0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

wiggzz force-pushed the fix/stateless-task-group-leak branch from 92d4f67 to 3179143 Compare March 9, 2026 18:22

Fix CI: pyright type errors in test and strict-no-cover false positive

0d513a6

wiggzz mentioned this pull request Mar 10, 2026

Pin mcp SDK to patched fork for stateless task leak fix dbt-labs/dbt-mcp#631

Merged

Merge branch 'main' into fix/stateless-task-group-leak

bb1a218

Merge branch 'main' into fix/stateless-task-group-leak

6237bc3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stateless HTTP task accumulation causing memory leak#2145

Fix stateless HTTP task accumulation causing memory leak#2145
wiggzz wants to merge 17 commits intomodelcontextprotocol:mainfrom
wiggzz:fix/stateless-task-group-leak

wiggzz commented Feb 25, 2026 •

edited

Loading

Uh oh!

wiggzz commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wiggzz commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation and Context

Root cause — memory leak

Root cause — broken SSE close on shutdown

Why the existing test didn't catch the leak

The fixes

Testing

Breaking changes

Checklist

Uh oh!

wiggzz commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wiggzz commented Feb 25, 2026 •

edited

Loading